{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import sys\n",
    "if \"../\" not in sys.path:\n",
    "  sys.path.append(\"../\") \n",
    "from lib.utils import read_data_from_file, sign"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Question 16-18"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "16. For Questions 16-20, you will play with the decision stump algorithm.<br/><br/>In class, we taught about the learning model of \"positive and negative rays'' (which is simply one-dimensional perceptron) for one-dimensional data. The model contains hypotheses of the form:$$h_{s,\\theta}(x)=s⋅sign(x−\\theta).$$The model is frequently named the \"decision stump'' model and is one of the simplest learning models. As shown in class, for one-dimensional data, the VC dimension of the decision stump model is 2.<br/><br/>In fact, the decision stump model is one of the few models that we could easily minimize $E_{in}$ efficiently by enumerating all possible thresholds. In particular, for N examples, there are at most 2N dichotomies (see page 2 of lecture 5 slides), and thus at most 2N different $E_{in}$ values. We can then easily choose the dichotomy that leads to the lowest $E_{in}$, where ties an be broken by randomly choosing among the lowest $E_{in}$ ones. The chosen dichotomy stands for a combination of some \"spot\" (range of $\\theta$) and s, and commonly the median of the range is chosen as the $\\theta$ that realizes the dichotomy.<br/><br/>In this problem, you are asked to implement such and algorithm and run your program on an artificial data set. First of all, start by generating a one-dimensional data by the procedure below:\n",
    " - Generate x by a uniform distribution in $[-1,1]$\n",
    " - Generate y by $f(x)=\\bar{s}(x)$ + noise where $\\bar{s}(x)=sign(x)$ and the noise flips the result with $20\\%$ probability.\n",
    " \n",
    " For any decision stump $h_{s, \\theta}$ with $\\theta\\in[−1,1]$, express $E_{out}(h_{s, \\theta})$ as a function of $\\theta$ and $s$.\n",
    " 1. $0.3+0.5s(|\\theta|-1)$\n",
    " 2. $0.3+0.5s(1-|\\theta|)$\n",
    " 3. $0.5+0.3s(|\\theta|-1)$\n",
    " 4. $0.5+0.3s(1-|\\theta|)$\n",
    " 5. none of the other choices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "sol:$\\lambda=\\frac{4}{5}, \\mu=\\frac{1+s}{2}\\frac{|\\theta|}{2}+\\frac{1-s}{2}\\frac{2-|\\theta|}{2}=\\frac{s|\\theta|-s+1}{2}\\\\ \\implies \n",
    "\\begin{eqnarray*}\n",
    "E_{out}(h_{s, \\theta})&=&\\lambda\\mu+(1-\\lambda)(1-\\mu) \\\\\n",
    "&=&0.6\\mu+0.2 \\\\\n",
    "&=&0.5+0.3s(|\\theta|-1)\n",
    "\\end{eqnarray*}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.542641286533492 1]\n",
      " [-0.958496101281197 1]\n",
      " [0.26729646985255084 1]\n",
      " [0.4976077650772237 1]\n",
      " [-0.0029859753948191514 -1]\n",
      " [-0.5504067089383047 -1]\n",
      " [-0.603874270480752 -1]\n",
      " [0.5210614243979175 1]\n",
      " [-0.6617783268749291 -1]\n",
      " [-0.8233203716519795 -1]\n",
      " [0.3707196367355945 1]\n",
      " [0.9067866923898731 1]\n",
      " [-0.9921034673441711 -1]\n",
      " [0.024384526771553228 1]\n",
      " [0.625241923304227 -1]\n",
      " [0.2250521336587763 1]\n",
      " [0.4435106348635991 -1]\n",
      " [-0.4162478636587337 -1]\n",
      " [0.8355482450258869 -1]\n",
      " [0.42915156679538113 1]]\n"
     ]
    }
   ],
   "source": [
    "np.random.seed(10)\n",
    "X = np.random.uniform(-1, 1, 20)\n",
    "y = sign(X) * sign(np.random.uniform(-0.2, .8, 20))\n",
    "\n",
    "print(np.vstack((X,y)).T)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def decision_stump_1d(X, y):\n",
    "    \"\"\"\n",
    "    One-Dimensional Decision Stump Algorithm\n",
    "    Args:\n",
    "        X: 数据\n",
    "        y: 标签\n",
    "    Returns:\n",
    "        s: 符号\n",
    "        theta: threshhold\n",
    "        err_in: ...\n",
    "    \"\"\"\n",
    "    h = lambda s, theta: s * sign(X - theta)\n",
    "    # sort data\n",
    "    indices = np.argsort(X)\n",
    "    X = X[indices]\n",
    "    y = y[indices]\n",
    "    \n",
    "    # cal err_in\n",
    "    thetas = (np.concatenate((X, X[-1:] + 1)) + np.concatenate((X[:1]-1, X)))/2\n",
    "    all_err_in = [(h(s,theta)!=y).mean() for s in [-1, 1] for theta in thetas]\n",
    "    \n",
    "    # find best s, theta\n",
    "    index = np.argmin(all_err_in)\n",
    "    s = [-1, 1][index//len(thetas)]\n",
    "    theta = thetas[index%len(thetas)]\n",
    "    err_in = all_err_in[index]\n",
    "    \n",
    "    return s, theta, err_in"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "e_out = lambda s, theta: .5 + .3 * s * (np.abs(theta) - 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "17. Generate a data set of size 20 by the procedure above and run the one-dimensional decision stump algorithm on the data set. Record $E_{in}$ and compute $E_{out}$ with the formula above. Repeat the experiment (including data generation, running the decision stump algorithm, and computing $E_{in}$ and $E_{out}$) $5,000$ times. What is the average $E_{in}$?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "18. Continuing from the previous question, what is the average $E_{out}$ out?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "average Ein: 0.1702\taverage Eout: 0.25734444036057114\n"
     ]
    }
   ],
   "source": [
    "np.random.seed(0)\n",
    "e_ins = []\n",
    "e_outs = []\n",
    "for i in range(5000):\n",
    "    # Generate data\n",
    "    X = np.random.uniform(-1, 1, 20)\n",
    "    y = sign(X) * sign(np.random.uniform(-0.2, .8, 20))\n",
    "    \n",
    "    # run the one-dimensional decision stump algorithm\n",
    "    s, theta, err_in = decision_stump_1d(X, y)\n",
    "    \n",
    "    # cal Eout\n",
    "    err_out = e_out(s, theta)\n",
    "    \n",
    "    e_ins.append(err_in)\n",
    "    e_outs.append(err_out)\n",
    "print('average Ein: {}\\taverage Eout: {}'.format(np.mean(e_ins), np.mean(e_outs)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Question 19-20"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "19. Decision stumps can also work for multi-dimensional data. In particular, each decision stump now deals with a specific dimension $i$, as shown below.$$h_{s,i,\\theta}(x)=s⋅sign(x_{i}−\\theta).$$Implement the following decision stump algorithm for multi-dimensional data:<br/><br/>a) for each dimension $i=1,2,⋯,d$, find the best decision stump $h_{s,i,\\theta}$ using the one-dimensional decision stump algorithm that you have just implemented.<br/><br/>b) return the \"best of best\"' decision stump in terms of $E_{in}$. If there is a tie , please randomly choose among the lowest-$E_{in}$ones<br/><br/>The training data $D_{train}$ is available at:<br/><br/>https://www.csie.ntu.edu.tw/~htlin/mooc/datasets/mlfound_math/hw2_train.dat<br/><br/>The testing data $D_{test}$ is available at:<br/><br/>https://www.csie.ntu.edu.tw/~htlin/mooc/datasets/mlfound_math/hw2_test.dat<br/><br/>Run the algorithm on the $D_{train}$. Report the $E_{{in}}$ of the optimal decision stump returned by your program.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def decision_stump_multi_dim(X, y):\n",
    "    \"\"\"\n",
    "    Multi-Dimensional Decision Stump Algorithm\n",
    "    Args:\n",
    "        X: 数据\n",
    "        y: 标签\n",
    "    Returns:\n",
    "        best: (dim, s, theta, err_in) 元组\n",
    "    \"\"\"    \n",
    "    dim = len(X[0])\n",
    "    \n",
    "    best_param = (0,0,0,np.inf)\n",
    "    for d in range(dim):\n",
    "        s, theta, err_in = decision_stump_1d(X[:,d], y)\n",
    "        #result.append((d, s, theta, err_in))\n",
    "        if err_in < best_param[3]:\n",
    "            best_param = (d, s, theta, err_in)\n",
    "    return best_param"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(100, 10)\n",
      "[[  8.105  -3.5     4.769   4.541  -9.829   5.252   3.838  -3.408  -4.824\n",
      "   -1.   ]\n",
      " [ -6.273  -2.097   9.404   1.143   3.487  -5.206   0.061   5.024  -6.687\n",
      "    1.   ]\n",
      " [  1.624  -1.173   4.26   -3.607  -6.632   4.431  -8.355   7.206  -8.977\n",
      "    1.   ]\n",
      " [-10.      7.758  -2.67   -8.88   -1.099  -9.183  -4.086   8.962   5.841\n",
      "    1.   ]\n",
      " [  8.464   1.762   2.729   2.724   8.155   6.096  -2.844   9.8     3.302\n",
      "   -1.   ]]\n",
      "(1000, 10)\n",
      "[[ 0.531 -1.884 -0.351 -1.796 -9.891  6.12   2.486  8.44  -5.123 -1.   ]\n",
      " [ 5.123  5.047  5.404 -1.742 -0.317  9.585 -4.016 -1.8   -5.633  1.   ]\n",
      " [ 3.286  4.251 -4.837 -7.065 -7.546 -4.727  9.055  4.941 -6.287  1.   ]\n",
      " [-0.795 -1.617 -8.414 -5.391  6.641  1.269 -5.806 -7.375  9.469  1.   ]\n",
      " [-4.362  1.49  -7.232  0.802  4.424 -4.777  6.075  3.48  -9.837  1.   ]]\n"
     ]
    }
   ],
   "source": [
    "data_train = read_data_from_file('hw2_train.dat')\n",
    "data_test = read_data_from_file('hw2_test.dat')\n",
    "print(data_train.shape)\n",
    "print(data_train[:5])\n",
    "print(data_test.shape)\n",
    "print(data_test[:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3, -1, 1.6175000000000002, 0.25)\n"
     ]
    }
   ],
   "source": [
    "h_best = decision_stump_multi_dim(data_train[:,:-1], data_train[:,-1])\n",
    "print(h_best)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "20. Use the returned decision stump to predict the label of each example within $D_{test}$. Report an estimate of $E_{out}$ by $E_{test}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cal_eout(X_test, y_test, i, s, theta):\n",
    "    h = lambda i, s, theta, X: s * sign(X[:,i] - theta)\n",
    "    return (h(i, s, theta, X_test) != y_test).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Etest =  0.355\n"
     ]
    }
   ],
   "source": [
    "print('Etest = ', cal_eout(data_test[:,:-1], data_test[:,-1], h_best[0], h_best[1], h_best[2]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
